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ABSTRACT 

In this paper, we analyze the efhciency of Monte Carlo meth- 
ods for incremental computation of PageRank, personalized 
PageRank, and similar random walk based methods (with 
focus on SALSA), on large-scale dynamically evolving social 
networks. We assume that the graph of friendships is stored 
in distributed shared memory, as is the case for large social 
networks such as Twitter. 

For global PageRank, we assume that the social network 
has n nodes, and m adversarially chosen edges arrive in a 
random order. We show that with a reset probability of e, 
the total work needed to maintain an accurate estimate (us- 
ing the Monte Carlo method) of the PageRank of every node 
at aU times is C>(ii^). This is significantly better than all 
known bounds for incremental PageRank. For instance, if 
we naively recompute the PageRanks as each edge arrives, 
the simple power iteration method needs to- 
tal time and the Monte Carlo method needs 0{mn/e) total 
time; both are prohibitively expensive. Furthermore, we 
also show that we can handle deletions equally efficiently. 

We then study the computation of the top k personal- 
ized PageRanks starting from a seed node, assuming that 
personalized PageRanks follow a power-law with exponent 
a < 1. We show that if we store R > qlnn random walks 
starting from every node for large enough constant q (us- 
ing the approach outlined for global PageRank), then the 
expected number of calls made to the distributed social net- 
work database is 0(/c/(il'^-°)/°)). 

We also present experimental results from the social net- 
working site. Twitter, verifying our assumptions and analy- 
ses. The overall result is that this algorithm is fast enough 
for real-time queries over a dynamic social network. 

1. INTRODUCTION 
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Over the last decade, PageRank [39] has emerged as a very 
effective measure of reputation for both web graphs and so- 
cial networks (where it was historically known as eigenvec- 
tor centrality [15]). Also, collaborative filtering has proved 
to be a very effective method for personalized recommen- 
dation systems 10 , 32 , 37 . In this paper, we will focus on 
fast incremental computation of (approximate) PageRank, 
personalized PageRank [14II19II39| , and similar random walk 
based methods, particularly SALSA [30] and personalized 
SALSA [381140) . over dynamic social networks, and its ap- 
plications to reputation and recommendation systems over 
these networks. Incremental computation is useful when 
edges in a graph arrive over time, and it is desirable to up- 
date the PageRank values right away rather than wait for a 
batched computation. 

Surprisingly, despite the fact that computing PageRank 
is a well studied problem ,5i, some simple assumptions on 
the structure of the network and the data layout lead to 
dramatic improvements in running time, using the simple 
Monte Carlo estimation technique. 

In large scale web applications, the underlying graph is 
typically stored on disk, and either edges are streamed [TT] 
or a map-reduce computation is performed. The Monte 
Carlo method requires random access to the graph, and has 
not found widespread practical use in these applications. 

However, for social networking applications, it is crucial 
to support random access to the underlying network, since 
messages flow on edges in a network in real-time, and ran- 
dom access to the social graph is necessary for the core func- 
tionalities of the network. Hence, the graph is usually stored 
in distributed shared memory, which we denote as "Social 
Store", providing a data access model very similar to Scal- 
able Hyperlink Store [36j . We use this feature strongly in 
obtaining our results. 

In this introduction, we will first provide some background 
on PageRank and SALSA and typical approaches to com- 
puting them, and then outline our results along with some 
basic efficiency comparisons. The literature related to the 
problem studied in this paper is really vast, and in the fi- 
nal subsection of this introduction, we do a comprehensive 
review of this literature, and compare our results with the 
related previous results. 

1.1 Background 

In this paper, we focus on (incremental) computation of 
PageRank [^, personalized PageRank [T1[T1[^, SALSA 
[30] , and personalized SALSA [38][40]. So, in this subsec- 
tion, we provide a quick review of these methods. Here and 
throughout the paper, we denote the number of nodes in the 



network by n, and the number of edges in the network by 
m. 

PageRank is the stationary distribution of a random walk 
which, at each step, with a certain probabihty e jumps to 
a random node, and with probabihty 1 — e foUows a ran- 
domly chosen outgoing edge from the current node. Per- 
sonalized PageRank is the same as PageRank, except that 
all the jumps are made to the seed node for which we are 
personalizing the PageRanks. 

SALSA, just like HITS [23] i associates two scores with 
each node i;, called hub score, hv, and authority score a„. 
These scores represent how good a hub or authority each 
node is. The scores are related as follows: 

/i„ = ^ a^/indeg(a::) 

{x\ {v,x)eE} 

= ^ /i„/outdeg(u) 

{v\ (v,x)£E) 

where E is the set of edges of the graph, and indeg(a:) and 
outdeg(t;) are, respectively, the in-degrees and out-degrees 
of the nodes. Notice that SALSA corresponds to a forward- 
backward random walk, where the walk alternates between 
forward and backward steps. The personalized version of 
SALSA, that we consider, allows random jumps (to the seed 
node) at forward steps. Thus, personalizing over node u 
corresponds to the following equations: 

hv = eSu,v -I- (1 — e) ^ ax/inAeg{x) 

{x\ (ti,a;)6£;} 

ax = ^ /ii,/outdeg(u) 

{v\ {v.x)eE) 

Notice that in our setting, hub scores and authority scores 
can be interpreted, respectively, as similarity measures and 
relevance measures. Hence, we obtain a simple system for 
recommending additional friends to a user: just recommend 
those with the highest relevance. We will now outline two 
broad approaches to computing PageRank and SALSA; a 
more detailed overview is presented in section [LS] The first 
approach is to use linear algebraic techniques, the simplest 
of which is the power iteration method. In this method, the 
PageRank of node v is initialized to 7ro(w) = 1/n and the 
following update is repeatedly performed: 

yv,ni+i{v) = e/n+ ^ ni{w){l — e)/outdeg{w). (1) 

{tu| (w,v)eE} 

This gives exponential reduction in error per iteration, re- 
sulting in a running time of 0{m) per iteration and 0(m/ 
ln(l/(l — e)) for getting the total error down to a constant. 
The other broad approach is Monte Carlo, where we directly 
do random walks to estimate the PageRank of each node. In 
the simplest instantiation, we do R "short random walks" 
of geometric lengths (with mean 1/e) starting at each node. 
Each short random walk simulates one continuous session by 
a random surfer who is doing the PageRank random walk. 
While the error does not decay as fast as in the power it- 
eration method, 7? = ln(n/e) or even R = 1 give provably 
good results, and have running time of 0{nR/e), which, for 



non-sparse graphs, is much better than that of power itera- 
tion. SALSA can be computed using obvious modifications 
to either approach. 

1.2 Our Results 

We study two problems. First, efficient incremental com- 
putation of PageRank (and its variants) over dynamically 
evolving networks. Second, efficiently computing personal- 
ized PageRank (and its variants) under the power-law net- 
work model. Both of these problems are considered to be 
among the most important problems regarding PageRank 
[26j . Below, we overview our results in more detail. 

We present formal analyses of incremental computation 
of random walk based reputations and collaborative filters 
in the context of evolving social networks. In particular, we 
focus on very efficiently approximating both global and per- 
sonalized variants of PageRank [S^ and SALSA [S^. We 
perform experiments to validate each of the assumptions 
made in our analysis, and also study the empirical perfor- 
mance of our algorithms. 

For global PageRank, we show that in a network with 
n nodes, and m adversarially chosen edges arriving in ran- 
dom order, we can maintain very accurate estimates of the 
PageRank (and authority score) of every node at all times 
with only O(^i^) total work. This is a dramatic improve- 
ment over the naive running time ^( in(i/^i_g)) ) (e-g-, by re- 
computing the PageRanks upon arrival of each edge, using 
the power iteration method) or Q,{^^) (e.g., using the Monte 
Carlo method from scratch each time an edge arrives). Simi- 
larly, we show that in a network with m edges, upon removal 
of a random edge, we can update all the PageRank approxi- 
mations using only 0(n/me'^) expected work. Our algorithm 
is a Monte Carlo method 3 that works by maintaining a 
small number of short random walk segments starting at 
each node in the social graph. The same approach works for 
SALSA. For global SALSA, the authority score of a node 
is exactly its in-degree as the reset probability goes to 0, 
so the primary reason to store these random walk segments 
is to aid in computing personalized SALSA scores. It is 
important to note that it is only the efficiency of our algo- 
rithms that depends on the random order assumption; it is 
also important to note that the random order assumption is 
weaker than assuming the well-known generative models for 
power-law networks, such as Preferential Attachment [3]. 

We then study the problem of finding the k nodes with 
highest personalized PageRank values (or personalized au- 
thority scores). We show that we can use the same building 
blocks used for global PageRank and SALSA, that is, the 
stored walk segments at each node, to very efficiently find 
very accurate approximations for the top k nodes. We prove 
that, assuming that the personalized scores follow a power- 
law with exponent < a < 1, if we cache R> glnn random 
walk segments starting at every node (for large enough con- 
stant q), then the expected number of calls made to the 
distributed social network database is 0(k/ R~^). This is 
significantly better than n and even k. Notice that with- 
out the power-law assumption, in general, one would have 
to find the scores of all the nodes and then find the top k 
results (hence, one would need at least Q,{n) work). 

We present the results of the experiments we did to val- 
idate our assumptions and analyses. We used the network 
data from the social networking site Twitter. The access to 



this data was through a database, called FlockDB, stored in 
distributed shared memory. Our experiments support our 
random order assumption on edge arrivals (or at least the 
specific claim that we need for our results to go through). 
Also, we observe that not only do the global PageRank 
scores and in-degrees follow the same power-laws (as previ- 
ously proved under mild assumptions in the literature [33|), 
but also the personalized PageRanks follow power-laws with 
average exponent roughly equal to the exponent for PageR- 
ank and in-degree. Finally, our experiments also support 
the proved theoretical bounds on the number of calls to the 
social network database. 

Random walk based methods have been reported to be 
very effective for the link prediction problem on social net- 
works [31]. We also did some preliminary experiments to ex- 
plore this further. The results are presented in appendix \K\ 
and indicate that random walk based algorithms (i.e., per- 
sonalized PageRank and personalized SALSA) significantly 
outperform HITS as a recommendation system for twitter 
users; we present this comparison not as significant original 
research but as an interesting data point for readers who are 
interested in practical aspects of recommendation systems. 

1.3 Related Work 

Any PageRank computation or approximation method on 
social networks is desired to have the following properties: 

1. Ability to keep the values (or approximations) updated 
all the time as the network evolves 

2. Large scale full personalization capability 

3. Very high computational efficiency 

Also, as briefly mentioned above, in social networking ap- 
plications, the data access model is dictated by the need for 
random access to the network, and implemented using a So- 
cial Store. Therefore, a desirable PageRank computation (or 
approximation) scheme should achieve the above mentioned 
features in this data access model. 

A simple way to keep the PageRank values updated is to 
just recompute the values for each incremental change in 
the network. But, this can be very costly. For instance, 
the simple power iteration method [39l to approximate (to 
a constant precision) PageRank values (with reset probabil- 
ity e) takes ^([^TpY^j-rTjy) time over a graph with x edges. 
Hence, over m edge arrivals, this takes JZ^i ^( hT(i7^rT7)y) 
~ ^( ) total time, which is many orders of magni- 

tude larger than our approach. Similarly, the n{n/e) time 
complexity of the Monte Carlo method results in a total 
Q{mn/e) work over m edge arrivals, which is also very inef- 
ficient. So, we need more efficient ways of doing the compu- 
tations. 

There have been a lot of methods proposed for computa- 
tion or approximation of PageRank and similar measures [5] . 
Broadly speaking, these methods can be categorized into two 
general categories, based on the core techniques they use: 

L Linear algebraic methods: These methods mainly use 
techniques from linear and matrix algebra, perhaps 
with some application of structural properties of the 
networks of interest (e.g., the world wide web) [Hll^lSI 
ri9. - .22..24..25..27. - .29.,35.i4'Tti43] . 



2. Monte Carlo methods: These methods use a small 
number of simulated random walks per node to ap- 
proximate PageRanks (or other variants) [51 llllll3t41) . 

A great survey of many of the methods in the first cat- 
egory is done by Langville and Meyer [26) . However, for 
completeness and also to compare the state of the art with 
our own results, we provide an overview of the methods and 
results in this category here. 

A family of the methods proposed in this category deal 
with accelerating the basic power iteration method for com- 
puting PageRank values [20fl22| . However, they all provide 
only very modest (i.e., small constant factor) speed ups. For 
instance, Kamvar et al. [22] propose a method to accelerate 
the power iteration, using an extrapolation based on Aitken 
method for accelerating linearly convergent sequences. 
However, as discussed in their paper, the time complexity 
of their method is ri(n), which is prohibitively large for a 
real-time application. Also, their experiments show only a 
25 — 300% speed up compared to the crude power iteration. 
So, the method does not perform well enough for our appli- 
cations. 

Another family of the methods in the first category deal 
with efficiently updating the PageRank values using the "ag- 
gregation" idea |9l24l25l27fl29) . The basic idea behind these 
methods is that when an incremental change happens in the 
network, the effect of this change on the PageRank vector is 
mostly local. That is, only the PageRanks of the nodes in 
the vicinity of the change may change significantly. To uti- 
lize this observation, these methods partition the set of the 
network nodes to two subsets G, G, where G is a subset of 
nodes close to where the incremental change happened, and 
G is the set of all other nodes. Then, all the nodes in G are 
lumped/aggregated into a single hyper-node, so a smaller 
network (composed of G and this hyper-node) is formed. 
Then, the PageRanks of the nodes in G are updated using 
this network, and finally the result is translated back to the 
original network. 

None of these methods seem well suited for real time ap- 
plications. First, the performance of these methods heavily 
depends on the partitioning of the network, and as pointed 
out in [27], a bad choice of this partitioning can cause these 
methods to be as slow as the power iteration. It is not known 
how to do this partitioning; while a number of heuristic 
ideas have been proposed [9|[25], there is also considerable 
evidence that these networks are expanders, and no such 
partitioning is possible. Further, it is easy to see that in- 
dependent of how the partitioning is done, partitioning and 
aggregation together will need 0(n) time. Also, notice that 
this work is in addition to the actual PageRank computa- 
tion that needs to be done on the aggregated network, and 
this computational load is also not negligible. For instance, 
as reported by Chien et al. [3, for a single edge addition 
to a network with 60M nodes, they need to do a PageRank 
computation on a network with almost SK nodes. After 
all, these methods start with a precise PageRank vector and 
give an approximate PageRank vector for the network after 
the incremental change. Therefore, even if these methods 
were run for a real-time application, the approximation er- 
ror would potentially accumulate, and the estimations would 
drift away from the true PageRanks [5J. Of course, we 
should mention that there exist exact aggregation based up- 
date methods, but all of those methods are more costly than 
power iteration [25| ! 



A number of other methods in the first category also deal 
with updating PageRanks |35II43| . However, the method 
in [13] does not scale well for the large scale social network- 
ing applications. It achieves 0{l^) update time for random 
walk based methods on an n x Z bipartite graph. This may 
work well when the graph is very skewed (i.e., I « n). 
But, for instance, in the friend recommendation application 
on social networks, / = n, so this gives only O(n^) update 
time, and hence O(mn^) total time over m edge arrivals, 
which is very bad. Also, McSherry [5S] combines a number 
of heuristics to provide some improvement in computation 
and updating of PageRank, using a sequential implemen- 
tation of the power iteration. The method works in the 
streaming model where edges are stored and then streamed 
from the disk. No guarantee is given about the tradeoff 
between precision and the time complexity of the method. 

Another family of methods in the first category 16.19,21] 
deals with personalization. In this family, Haveliwala's work 
[16| achieves personalization only to the level of few (e.g., 
16) topics and provides no efficiency improvement over the 
power iterations. 

Kamvar et al. [21] use the host-induced block structure of 
the web link graph to speed up computation and updating of 
PageRank and also provide personalization. However, first, 
in social networks, there is no equivalent of a web host, and 
more generally it is not easy to find a corresponding block 
structure (even if such a structure actually exists). There- 
fore, it is not even clear how to apply this idea to social 
networking applications. Also, they use the block structure 
only to come up with a better initialization for a standard 
PageRank computation, such as power iteration. There- 
fore, even though, after an incremental change, the local 
PageRank vectors (corresponding to unchanged hosts) may 
be reused, doing the power iteration alone would need Q{m) 
work per each change (and hence a total Q,{m?) work over 
m edge arrivals). Finally, on the personalization front, their 
method achieves personalization only at the host level. 

Jeh and Widom [19) achieve personalization only over a 
subset of the network nodes, denoted as the "Hub Set". 
Even though, the paper provides no precise time complexity 
analysis, it is easy to see that, as mentioned in the paper, 
the time performance of the presented algorithm heavily de- 
pends on the choice of the hub set. In our application where 
we wish to have full personalization, the hub set needs to be 
simply the set of all vertices, in which case the algorithms 
in [15] reduce to a simple dynamic programming which pro- 
vides no performance improvement over power iteration. 

Another notable work in the first category is [IT]. It uses 
deterministic rounding or randomized sketching techniques 
along with the dynamic programming approach proposed 
in [15] to achieve full personalization. However, the time 
complexity of their (rounding) method, to achieve constant 
error, is 0(m/e), while, the time complexity of the simple 
Monte Carlo method to achieve constant error with high 
probability is just 0{n\mn/ e). Therefore, if m = uj{nhin), 
which is expected in our applications of interest, then the 
simple Monte Carlo method is asymptotically faster than the 
method introduced in [41]. Also, it is not clear how would 
one be able to efficiently update this method's estimations 
as the network changes. 

The methods in the second category above, namely Monte 
Carlo methods, have the advantage that they are very effi- 
cient and can achieve full personalization on a large scale 



[3l[T3]. However, all the literature in this category deals 
with static networks. Of course, it has been mentioned in [3] 
that one can keep the approximations updated continuously. 
However, they neither provide any details of how exactly to 
do this nor give any analysis about the efficiency of doing 
these updates. For instance, the method in [3] uses 
work in each computation. So, if we naively recompute the 
PageRank using this method for each edge arrival, then over 
m edge arrivals, we will have f2(-^) total work. 

In contrast, in this paper, we show that one can use the 
Monte Carlo techniques to achieve very cheap incremental 
updates. Indeed, we prove a surprising result, stating that 
up to a logarithmic factor, the total work required to keep 
the approximations updated all the time is the same as the 
work needed to just initialize the approximations! More 
precisely, we prove that over m edge arrivals in a random 
order, we can keep the approximations updated using only 
0( " g"'" ) total work. This is significantly better than all the 
previously proposed methods for doing the updates. 

Another issue with the methods in the second category is 
that if we want to directly use the simulated random walk 
segments to approximate personalized PageRank, we would 
get a limited precision. For instance, if we store R random 
walks per node (and use only the end points of these walks 
for our estimates) the approximate personalized PageRank 
vectors that we get would have at most R non-zero entries, 
which is significantly fewer than what we need in all applica- 
tions of interest; previous works |11II13| do not explore this 
tradeoff in any detail. 

In this paper, we present a formal analysis of this trade 
off in the random access model, and prove that under the 
power-law model for the network, one can do long enough 
walks (and hence get desirable levels of precision) very ef- 
ficiently. Indeed, Das Sarma et al. [TT] achieve time per- 
formance 0(m/ ^/e), and hence using their method for each 
incremental change in the network would need 0{m^ / y/e) 
total work over m edge arrivals. While, this is better than 
the naive power iteration 0{rn? / In (1/(1 — e))) time complex- 
ity, it is still very inefficient. But, we show that in our model, 
we can achieve an 0{nlnm/e^) time complexity, assuming a 
power-law network model, which is significantly better, and 
good enough for real-time incremental applications. 

Collaborative filtering on social networks is very closely 
related to the Link Prediction problem [n iT2l[T7llTFll3ni42] . 
for which the random walk based methods have been re- 
ported to work very well |31) (we also verified this exper- 
imentally; the results are given in Appendix [X]) . Most of 
the literature on Link Prediction deals with static networks. 
But, there are some proposed algorithms which deal with 
dynamic evolving networks [TJ|43] . However, none of these 
methods scale to today's very large scale social networks. 

We would also like to mention that there are incremental 
collaborative filtering methods based on low-rank SVD up- 
dates. For instance, refer to [7] and references therein. How- 
ever, these methods also do not scale very well. For instance, 
the method proposed in [^ requires a total 0(pqr) work to 
calculate the rank-r SVD of a px 5 matrix. But, for instance, 
for the friend recommendation application, p — q = n, and 
hence this method needs a total f2(n^) work, while we only 
need a total 0( "'j^'" ) work, as mentioned above, which is 
significantly better. 



2. INCREMENTAL COMPUTATION OF 
PAGERANK 

In this section, we explain the details of the Monte Carlo 
method for approximating the (global) PageRank values. 
Then, we will prove our surprising result on the total amount 
of work needed to keep the PageRank estimates updated all 
the time. Even though we will focus on PageRank, our 
results will extend (with some minor modifications that will 
be pointed out) to other random walk based methods, such 
as SALSA. 

2.1 Approximating PageRank 

To approximate PageRank, we do R random walks start- 
ing at each node of the network. Each of these random walks 
is continued until its first reset (Hence, each one has average 
length 1/e). We store all these walk segments in a database, 
where each segment is stored at every node that it passes 
through. Assume for each node v, Xv is the total number 
of times that any of the stored walk segments visit v. Then, 
we approximate the PageRank of v, denoted by tTv, with: 



nR/e 

Then, we have the following theorem: 

Theorem 1. ttv is sharply concentrated around its expec- 
tation, which IS TVv . 

This theorem is proved in AppendixlB] The exact concen- 
tration bounds follow from the proof. But, to summarize, 
the obtained approximations are quite good even for 7? = 1. 
Thus, the defined 7f„'s are accurate approximations of the 
actual PageRank values ttv 

2.2 Updating the Approximations 

As the underlying network evolves, the PageRank values 
of the nodes also change. For a lot of applications, we would 
like to have updated approximations of the PageRank values 
all the time. Thus, here we analyze the cost of keeping the 
approximations updated at all times. To keep the approx- 
imations updated, we only need to keep the random walk 
segments stored at the network nodes updated. Thus, we 
analyze the amount of work in keeping these walks updated. 

We first prove the following proposition: 

Proposition 2. Assume {ut,vt) is the random edge ar- 
riving at time t (1 < t < m). Define Mt to be the number 
of random walk segments that need to be updated at time t 
(1 < t < m). Then, we have: 

E[M,] < ^E[ ] 
e outdeg^^ [t) 

where the expectation on the right hand side is over the edge 
arrival at time t (i.e., E[-] = -B(ut.uf ) ['lA ebnd outdeg^it) is 
the outdegree of node u after t edges have arrived. 

Proof. The basic intuition in this proposition is that 
most random walk segments miss most network edges. More 
precisely, a walk segment needs to change only if it passes 
through the node ut and the random step at ut picks Vt 
as the next node in the walk. In expectation, the num- 
ber of times each walk segment visits ut is For each 
such visit, the probability for the walk to need a reroute is 



^ —. Hence, by a union bound, the probability that 

a walk segment needs an update is at most — ^ — . 

outdeg^jt) 

Also, there is a total of nR walk segments. Therefore, by 
linearity of expectation: 



outdeg„(t) 



e ^outdeg„^(f)J 
which proves the lemma. □ 

In the above proposition, — —] depends on the 

exact network growth model. The model that we assume 
here is the random permutation model, in which m adver- 
sarially chosen directed edges arrive in random order. Notice 
that this is a weaker assumption on the network evolution 
model than most of the popular models, such as the Pref- 
erential Attachment model We also experimentally val- 
idate this model, and present the results, confirming that 
this is actually a very good assumption, later in this paper. 
For this model, we have the following lemma: 

Lemma 3. If {ut,Vt) is the edge arriving at time t (1 < 
t < m) in a random permutation over the edge set, then: 

E[- 



outdeg^^ (t) t 
Proof. For random arrivals, we have: 

outdeg^(t) 



Pr[ut — u] — 



t 



Hence, 



E[ — — —^] = TVu — —r — TTTPrlut — u] 
outdeg„^(t)J ^ outdeg„(t) 



outdeg^(^) 



outdeg„(t) 



ET^u 1 \ - 1 



□ 



From lemma [3] and proposition [S] we get the following 
theorem: 

Theorem 4. The expected amount of update work as the 
f"* network edge arrives is at most nR/te^ , and the expected 
total amount of work needed to keep the approximations up- 
dated over m edge arrivals is at most ^ Inm. 

Proof. Defining Alt as in proposition [2l we know from 
the same proposition that 

E[Mt] < —E\ , J 

^ e 'outdeg„^(t)J 

Also, from lemma [3l we know: 

^. 1 

EV 



outdeg„^ {t) 



Hence, 

r-,r. , 1 ^ nR 1 
E[Mt] < — J 

For each walk segment that needs an update, we can redo 
the walk starting at the updated node, or even more simply 
starting at the corresponding source node. So, for each such 
walk segment, in average we need at most 1/e work (equal to 
the average length of the walk segment). Hence, the average 
work needed at time t is at most as stated in the 

theorem. 

Summing up over all time instances, we get that the ex- 
pected total amount of work that we need to do over m edge 
arrivals (to keep the approximations updated all the time) 
is: 

nR 1 nR ^ nR , 

where Hm is the m*'^ harmonic number. Therefore, the total 
amount of expected work required is at most Inm, which 
finishes the proof. □ 

The above theorem bounds the amount of update work as 
new edges arrive. Similarly, we can show that we can very 
efficiently handle edges leaving the graph: 

Proposition 5. When the network has m edges, if a ran- 
domly chosen edge leaves the graph, then the expected amount 
of work necessary to update the walk segments is at most 
nR/me^ . 

Proof. If M is the number of walk segments that need 
to be updated, and {u* ,v*) is the random edge leaving the 
network, then exactly as in Proposition [21 one can see that: 

E[M] < 

outdes * ^ ' ^^'^ exactly as in Lemma [3l one 

can see ^[^^jj^j^ ] ~ 1/m. Finally, as in Theorem |4l the 

proof is completed by noticing that for each walk segment 
needing an update, the expected amount of work is 1/e. □ 

The result in Theorem [4] (and similarly Proposition [5]) is 
quite surprising (at least to the authors). Hence, we will now 
discuss various aspects of it. First, notice that the amount 
of work needed to keep the approximations updated all the 
time is only logarithmically larger than the cost to initialize 
the approximations (i.e., nR/e). Also, it is clear that the 
marginal update cost for not-so-early edges (say edges after 
the r2(n) first ones) is so small, that we can do the updates 
in real time even per social interaction (e.g., clicks, etc.) 

Also, notice that in applications in networks such as the 
web graph, we can not do the real time updates. That is 
because there is no way to figure out the changes in the 
network, other than recrawling the network which is not 
feasible in real time. Also, random access to edges in the 
network is expensive. However, in social networking appli- 
cations, the network and all the events happening on it are 
always available and visible to the network provider. There- 
fore, it is indeed possible to do the real time updates, and 
hence this method is very well suited for social networking 
applications. 

It should be mentioned that the update cost analyzed 
above is the extra cost due to updating the PageRank ap- 
proximations. In other words, as a new edge is added to 
the network it should be added to the database containing 



the network. We can keep the random walk segments in 
another database, say PageRank Store. For each node v, 
we also keep two counters: one, denoted by W{v), keeping 
track of the number of walk segments visiting v, and one, 
denoted by d{v), keeping track of the outdegree of v. Then, 
when a new edge arrives at node v, first we add it to the 
Social Store. Then, with probability 1 - (1 - l/d(w))^'''^ 
we call the PageRank Store to do the updates, in which case 
the PageRank Store will incur an additional cost as analyzed 
above in Theorem S) We are assuming that the preprocess- 
ing (to generate the random number) can be done for free, 
which is reasonable, as it does not require any extra network 
transmissions or disk accesses; without this assumption, the 
running time would be 0(m -I- "'j^*" ), which is still much 
better than the existing results. 

In theorem |4j we analyzed the update costs under the 
random permutation model. Another model of interest is 
the Dirichlet model, in which Pr[ut = il\ = [du(i — 1) -|- 
— H- n]. Following the same proof steps as in theorem 
[l] we can again prove that the total expected update cost 
over m edge arrivals in this model is ^ ln( ). Again, the 
total updates cost is only logarithmically growing, and the 
marginal update costs are small enough to allow real time 
updates. This also raises the question that does the same 
result hold in the adversarial arrival model? Interestingly, 
the answer to this question is negative. We show this with 
an example. 

Example 1. Consider a network formed by a directed N- 
cycle Hi, 112, ... , vn , a node u, N nodes x\, X2, ■ ■ ■ , xn , and 
N nodes j/i, 2/2, . . . , J/jv (hence, the total number of nodes in 
the network is n = 3N + 1). Assume every Vj (1 < j < N) 
has an edge to u; u has an edge to every xj (i < j < N ); 
every Xj (1 < j < N) has an edge to u; vi has an edge to 
every yj (1 < j < N), and every yj (1 < j < N) has an 
edge to vi . Then, in this network, adding just one edge from 
u to vi will force r2(n) random walk segments to need to be 
updated. So, it is not true that in an adversarial edge arrival 
model, the amount of work needed at each time instance van- 
ishes over time. In other words, the above-presented results 
indeed use the randomness in the arrival order of the edges. 

2.3 Extension to SALSA 

To approximate the hub and authority scores in SALSA, 
we need to keep 2R random walk segments per node; R 
random walks starting with a forward step from the seed 
node, and R walks starting with a backward step. Then, the 
approximations are done similar to the case of PageRank, 
and the sharp concentration of the approximations can be 
proved in a similar way. 

For the update cost, we notice that if (ut,Vt) is the edge 
arriving at time t, then, unlike the PageRank case where 
only ut could cause updates, both ut and vt can cause walk 
segments to need an update. Again, we assume the random 
permutation model for edge arrivals. Then: 

p,[,^^^]^indeg,W 
and we get the following theorem: 



Theorem 6. The expected amount of work needed to keep 
the approximations updated over m edge arrivals is at most 

Rather than presenting the complete proof, which follows 
exactly the same steps as the one for theorem 31 we just ex- 
plain from where the extra factor 16 is appearing: Instead 
of R walks we are storing 2R walks per node, introducing 
a factor of 2. Rather than 1/e, each walk segment has av- 
erage length 2/e (because we only allow resets at forward 
steps), introducing a factor of 4 (as e appears in the bound 
as e^). Also, each time an edge (wt,ut) arrives, both ut and 
Vt can cause updates, hence twice as many walks need to be 
updated at each time. These three modifications, together 
cause a factor 16 difference. 

3. APPROXIMATING PERSONALIZED 
PAGERANK AND SALSA 

In the previous section, we showed how we can approx- 
imate PageRank and SALSA by keeping a small number 
of random walk segments per each node. In this section, 
we show how we can reuse the same stored walk segments 
to also approximate the personalized variants of PageRank 
and SALSA. Again, we will focus the discussion on (person- 
alized) PageRank. All the results also extend to the case of 
SALSA. 

The idea of using simulated random walk segments to ap- 
proximate personalized PageRank was introduced in [13j . 
However, in that paper, the personalized PageRank values 
are simply approximated by the frequency of node visits 
in the walk segments. This limits the precision of the ap- 
proximations. To improve the precision, the paper proposes 
to stitch the walk segments together to form longer walks. 
However, they provide neither any details of how to do this 
nor any analysis of the trade off between the precision and 
the query time. 

Das Sarma et al. [11] propose a method to stitch the walk 
segments to form longer walks and analyze its performance 
in the streaming model in which the graph is accessed by 
streaming from disk. In this paper, we use an almost identi- 
cal algorithm to do the personalized PageRank approxima- 
tions. However, since we work in the random access model, 
we will need a different analysis of the algorithm. 

We start by explaining the algorithm. Here, the basic idea 
is that we will perform the personalized PageRank random 
walk, but opportunistically use the R stored random walk 
segments (described in section [2]) for each node, where pos- 
sible. To access these walk segments, we have to query the 
database containing them. A query to this database for a 
node u returns all R walk segments starting at u as well as 
all the neighbors of u. We call such a query a "fetch" oper- 
ation. Then, taking a random walk starting from a source 
node w, based on the stored walk segments, can be done as 
presented in Algorithm [1] 

The main cost in this algorithm is the fetch operations it 
does. Everything else is done in main memory which is very 
fast. Thus, we would like to analyze the number of fetches 
made by this algorithm. 

But, unlike the case of global PageRank and SALSA, we 
notice that in applications of personalized PageRank and 
SALSA, what we are interested in is the nodes with the 
largest values of the personalized score. For instance, in a 
recommendation system based on personalized SALSA or 



Algorithm 1 Personalized PageRank Walk Using Walk 
Segments 

Input: Source node w, required length L of the walk 
Output: A personalized PageRank walk P,„ for source 
node w of length at least L 

Start the walk at w. [w] 
while length(Pu,) < L do 
u <r- last node in 

Generate a uniformly random number /3 G [0, 1] 
if /3 < e then 

Reset the walk to w. P^ ^ P^ ■ a.ppenA(w) 
else 

if It has an unused walk segment Q remaining in 

memory then 

Add Q to the end of P^: P„ ■ append(Q) 

Then, reset the walk to w. P^ Pw ■ append(w;) 

else 

if u was previously fetched then 
Take a random edge (m, v) out of u 
Add V to the end of P^: -f- Pu, • append(«) 
else 

Do a fetch at u 
end if 
end if 
end if 
end while 



personalized PageRank, we are only interested in the nodes 
with the largest authority scores, because the system is even- 
tually going to find and recommend only those nodes any- 
way. Thus, our objective here is to find the k nodes (for 
some suitably chosen k) with the largest personalized scores. 
We show that, under a power-law network model, the above 
algorithm does this very efficiently. 

We start with first exactly explaining our network model. 

3.1 Network Model 

If ~^ is the vector of the scores of interest (e.g., person- 
alized PageRanks), we assume that it follows a power-law 
model. That is, if ttj is the j*'' largest entry in Ir , then we 
assume: 

■Kj oc j~" (2) 

for some < a < 1. This is a very well-motivated model. 
First, it has been proved [33] that (under some assumptions), 
in a network with power-law indegrees, the PageRank vec- 
tor also follows a power-law, which has the same exponent 
as the one for indegrees. Our experiments with the Twit- 
ter social network, whose results are presented in section |4l 
not only confirm this result, but also show that personal- 
ized PageRank vectors also follow power-laws, and that the 
average exponent for the personalized PageRank vectors is 
roughly the same as (yes, you guessed it!) the one for inde- 
gree and global PageRank. 

By approximating summation with integration, we can 
approximately calculate the normalizing factor in equation 
[2j denoted by -q: 

n n „i ^ 

t \ ^ \ ^ -—a a—1 I —a j a—1 ^ 

1 ^ 2^nj ^ ri2_^j ~rjn x dx = rin .—— 



Hence 77 = (1 — a)/n^ 



and: 



(1 - 



(3) 



So, we will assume that the values of vTj's are given by 
equation [3] (i.e., we ignore the very small error in estimat- 
ing the summation with integration). This completes the 
description of our model. 

3.2 Approximating the top k nodes 

Fixing a number c, we do a long enough random walk 
(with resets to the seed node) that for each of the top k 
nodes, we expect to see that node at least c times. Then, 
we will return the k nodes most visited in this random walk. 
To this end, we first give a definition and prove a small 
technical lemma. 

Definition 1. Xs,v is the number of times that we visit 
node V in a random walk of length s. 

Lemma 7. J2v \E[Xs,v] — s.-Kv\ < 2/e 

The proof of this lemma is given in Appendix [Cl 
The above lemma shows that if stt^ is not very small (e.g., 
compared to 1/e), we can approximate E[Xs,v] (and because 
of sharp concentration of X^^v, even Xs^v itself) with S7r„. 
This is what we will do in the rest of this section. 

Therefore, in order to see each of the top k nodes c times 
in expectation, the minimum length of the walk that we 
need to take is determined by sn^ ~ c, which gives: 



Sfe = 



— a k 



(4) 



This gives us the length of the walk that we need to do 
using our algorithm. So, now we can analyze the algorithm. 
We prove the following theorem: 

Theorem 8. If we store R> glnn walk segments at each 
node for a large enough constant q, then the expected number 
of fetches done to take a random walk of length s is at most: 
l + (2(l-a)/ni?)i-i.si/" 

Proof. A fetch is made at node v only if we arrive at u 
from a parent node v which ran out of unused walk segments. 
In other words, each fetch at a node u can be charged to 
an extra visit to one of u's parents. Therefore, denoting the 
number of fetches made during the algorithm by F, we have: 



Hence, 



F < Y.{Xs,^ - R)+ 

V 

E[F]<^E[{X^,^-R)+] 

V 

E[Xs,^ - R\ Xs,v > R]Pr{Xs.v > R) 



<n Pr{Xs,v >R)+ E[Xs.v] 

{v\ E[Xs.v]<R/2} {v\ E[Xs,v]>R/2} 

< 1 + Y E[Xs.v] 



Where the second to last inequality holds because (due 
to the memoryless property of the random walk) E[Xs,v — 
R\ Xs,v > R] < E[Xs,v], and the last inequality holds be- 
cause with R > glnn for large enough q, if E[Xs,v] < R/'2 
then Pr{Xs,v > R) — o{l/n) using Chernoff bounds (and 
if iSfXa^u] > R/2 then E[Xs.v] is almost equal to S7r„, as 
mentioned after lemma [7|. 

But, STTv > R/2 if and only if ii < r where r is such that 
sttt = r/2. This gives: 



(1 



O)^ 2£ i/„ 

7-1 ^r' 



and E[F] < l+X]J=i j - Upperbounding the summation 
with integration, we get: 



E[F] < 1 (1 - a) 



-/JV 



°'dx = 1-1- s{T/n)^ 



= i + (?(l_£!))i-.,v. 

nR 

which finishes the proof. □ 

Remark 1. We defined a fetch operation to return all 
the stored walk segments as well as all the outgoing edges 
of the queried node. While the number of stored walks per 
each node is small, the outdegree of a node can be very large 
(indeed as large as fl{n)). For instance, in the Twitter so- 
cial network, @BarackObama has more than 750, 000 out- 
going edges. Thus, a fetch at such nodes may cause mem- 
ory problems. However, notice that if we change the fetch 
operation for node w to either return all R stored walk seg- 
ments starting at w or just one randomly sampled outgoing 
edge from w, then in the analysis in theorem^E^ we will only 
get at most a factor 2 more fetches (because, we will have 
F < 2'^^{Xs,v — R)~^ )■ So, we will just stick with our orig- 
inal definition of a fetch operation. 

Using the value of sj, from equation |3] in the result of 
theorem [S] directly gives the following corollary: 

Corollary 9. Storing R > qlnn walk segments per node 
for a large enough constant q, the expected number of fetches 



needed to find the top k nodes is at most 1-f ■ 



cl/o 



-k. 



(l-a)(H/2)o 

It should be noted that the bounds given in theorem[8]and 

corollary|9]are, respectively, O I " J and O I — ^ 

Also, as our experiments (described later) show, the theo- 
retical bounds are fairly accurate for values of R as small as 
5. 

Remark 2. To compare the bounds from equation^ and 
corollary\^ let a = 0.75, c = 5, R= 10, k = 100, and n = 
10* . Then, [7] bounds the number of required steps ( also equal 
to number of database queries, if done in the crude way) with 
632k = 63200, while\^bounds the number of required fetches 
with the much smaller number 20k = 2000. Also, notice how 
significantly smaller than n = 10* both these bounds are. 
This is because we are taking advantage of the power-law 
assumption/property for the random walks we are interested 
in. Without this assumption, in general, even to find the 
top k nodes, one would need to calculate all n entries of the 
stationary distribution, and then return the top-k values. 



4. EXPERIMENTS 

In this section, we present the results of the experiments 
that we did to test our assumptions and methods. 

4.1 Experimental Setup 

We used data from the social networking site, Twitter, 
for our experiments. This network consists of directed edges. 
The access to data was through Twitter's Social Store, called 
FlockDB, stored in distributed shared memory. We emu- 
lated the PageRank Store on top of FlockDB. We used the 
reset probability e = 0.2 in our experiments. For the per- 
sonalized experiments, we picked 100 random users from the 
network, who had a reasonable number of friends (between 
20 and 30). 

4.2 Verification of the Random Permutation 
Assumption 

Given a single permutation (i.e., the one that we actually 
observed on twitter), it is impossible to validate whether 
edges do arive in random order. However, we can validate 
some associated statistics, and as it turns out, these will also 
provide a sufficient precondition for our analysis to hold: 

1. Let X denote the expected value of 7ri,/outdegj, for an 
arriving edge {v,w). We assumed that mX = 1 in our 
proof; this is the only assumption that we need in order 
for our proof to be applicable. In order to validate this 
assumption, we looked at 4.63 Million edges arriving 
between two snapshots of the social graph (we removed 
edges originating from new nodes). The average ob- 
served value of mX for these 4.63 Million arrivals was 
0.81. This validates the running time we obtained in 
this paper (in fact, this is a little better than what 
we assumed since smaller values of mX imply reduced 
computation time). 

2. While not strictly required for our proof, another inter- 
esting consequence of the random permutation model 
is that the probability of an edge arriving out of node 
V is proportional to the out-degree of iQ. We will call 
this the proportionality assumption. Let a{d) denote 
the fraction of newly arriving edges {v, w) such that 
outdeg„ < d. We will refer to a{d) as the arrival 
degree cdf (cumulative distribution function). Fur- 
ther, let s{d) denote the sum of the degrees of all the 
nodes which have degree at most d, and let e{d) denote 
s{d)/m. We will refer to e(d) as the existing degree cdf. 
If the proportionality assumption is true, we would ex- 
pect the two cdfs to nearly coincide. Figure [1] shows 
plots of these two cdfs; it is clear that the two cdfs 
indeed track each other quite well. 

4.3 Network Model Verification 

As we explained in section 13.11 we assumed a power-law 
model on the personalized PageRank values. In addition to 
considerable literature [33], this assumption was also based 
on our experiments, showing the following results: 

1. Our network has power-law indegrees and global PageR- 
ank, with exponent roughly 0.76. The results of the 
experiment for this result is presented in figure (2] 

^In fact, 1-1- the out degree, but that distinction will not be 
empirically important. 
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Figure 1: Arrival degree and existing degree cumu- 
lative distribution functions 
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Figure 2: Indegree and PageRank power-laws. Both 
axes are log-scale. 



2. Personalized PageRank vectors follow power-laws. The 
results for the personalized PageRank vectors of 6 ran- 
dom users are presented in figure [3] Around 2% of the 
nodes had a > 1. Our analysis is easily adapted to 
this case, but we omit the details. 

3. As shown in figure U there is a variation in the ex- 
ponents of the power-laws followed by personalized 
PageRank vectors of different users. However, the 
mean of these exponents is almost the same as the 
exponent for indegree and PageRank. In our experi- 
ment, the average exponent was 0.77 and the standard 
deviation was 0.08. 

Remark 3. The number on top of each plot in figure El 
shows the number of friends of the corresponding user. No- 
tice that there is an initial part in each of these plots which 
follows a different power-law than the bulk of the vector. 
This is mainly due to the direct friends of the user getting 
a lot of weight. But, this is no problem in our applications, 
because for instance a recommendation system will not rec- 
ommend the friends of the user anyway. In other words, we 
don't particularly care about the initial part of the vector. 



Figure 3: Personalized PageRank power-laws for 
6 random users; x-axis is index i and y-axis is i**^ 
largest personalized PageRank. Both axes are log- 
scale. 
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Figure 4: Sorted power-law exponents for 100 ran- 
dom users. 



Remark 4. For the calculation of the power-law expo- 
nent of the personalized PageRank of each user (in figure 
we only considered the part of the vector indexed between 
[2/, 20/] where f is the number of friends of the user. Again, 
the reason we did this was that this is really the only part of 
the vector which matters to us in our applications. 

4.4 A Few Random Steps Go a Long Way 

As we showed in section 13.21 a relatively small number 
of random walk steps (e.g., as calculated in equation |4| are 
enough to approximately find the top k personalized PageR- 
ank nodes. We experimentally tested this idea. To do so, 
notice that the stationary distribution of a random walk is 
the limit of the empirical walk distribution as the length of 
the walk approaches oo. Thus, to accurately find the top 
k nodes, we can theoretically do an infinite walk, and find 
the top k most frequently visited nodes. Based on this idea, 
we did the following experiment: For each of the 100 ran- 
domly selected users, we did a random walk, personalized 



Figure 5: 11 point interpolated average precision for 
top 1000 results. 



over that user, with 50000 steps. We calculated the top 
100 most visited nodes, and considered them as the "true" 
top 100 results. Then, we did a 5000 step random walk for 
each user, and retrieved the top 1000 most visited nodes. 
For both experiments, we excluded nodes that were directly 
connected to the user. Then, we calculated the 11 point in- 
terpolated average precision curve [34]. The result is given 
in figure [S] Notice that the curve validates our approach. 
For instance, the precision at recall level 0.8 is almost equal 
to 0.8, meaning that 80 of the top 100 "true" results were 
returned among the top 100 results of the short (5000 step) 
random walks. Similarly, precision of almost 0.9 at recall 
level 0.7 means 70 of the top 100 "true" results were re- 
trieved among the top 77 results. This shows that even 
short random walks are good enough to find the top scoring 
nodes (in the personalized setting). 

4.5 Number of Fetches 

In theorem |S] we gave an upperbound on the number of 
fetches needed to compose a random walk out of the stored 
walk segments. We did an experiment to test this theoretical 
bound. In our experiments we found the average (over 100 
users) number of fetches actually done to make a walk of 
length s, for s between 100 and 50000, when we store R walk 
segments per node, for each of the cases with R G {5, 10, 20}. 
These are the thin lines in the plots in figure [HI We also 
calculated the corresponding theoretical upperbound on the 
number of fetches for each user (using its own power-law 
exponent), and then calculated the average over the 100 
users. The results are the thick lines in the plots in figure 
|6l As can be seen in this figure, our theoretical bounds 
actually give an upperbound on the actual number of fetches 
in our experiments. Also, we see that the number of fetches 
that we make is not much sensitive to the number of stored 
random walks per node (i.e., R). Note that the theoretical 
guarantees are only valid for R > q\nn for a large enough 
constant q; hence the theoretical bound appears to be robust 
well before the range where we proved it. 
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APPENDIX 

A. EFFECTIVENESS OF RANDOM WALK 
BASED METHODS FOR LINK PREDIC- 
TION 

As mentioned in the Introduction section, random walk 
based methods have been reported to be very effective for 
the link prediction problem on social networks [31]. We 
also did some experiments to explore this further. These 
are somewhat tangential to the rest of this paper, and by 
no means exhaustive. We present them not as significant 





HITS 


COSINE 


PageRank 


SALSA 


Top 100 


0.25 


4.93 


5.07 


6.29 


Top 1000 


0.86 


11.69 


12.71 


13.58 



Table 1: Link Prediction Effectiveness 



original research but as an interesting data point for readers 
who are interested in practical aspects of recommendation 
systems. 

We picked 100 random nodes from the Twitter social net- 
work. To select these users, we considered the network for 
two different dates, with 5 weeks of difference. Then, we se- 
lected random users who had a reasonable number of friends 
(between 20 and 30) on the first date, and increased the 
number of their friends by a factor between 50% and 100% 
by the second date. For the second date, we only counted the 
friends who already existed (and were reasonably followed, 
i.e., had at least 10 followers) on the first date (because, 
otherwise there is no way for a collaborative filter or link 
prediction system to find these users). 

The reason we enforced the above criteria was that these 
users are among the typical reasonably active users on the 
network. Also, because they are increasing their friends set, 
they are good targets for a recommendation system. 

For each of the 100 users, we used the network data from 
the first date to generate a personalized list of predicted 
links. We considered four link prediction methods: person- 
alized PageRank, personalized SALSA, personalized HITS, 
and COSINE. We already explained personalized PageRank 
and personalized SALSA in the paper. Personalized HITS 
and COSINE also assign hub and authority scores to each 
node. For personalized HITS, when personalizing over node 
u, these scores are related as follows: 

hv = eSu,v -I- (1 — e) ^ ax 

{x\ (v,x)eE} 

Ojx — ^ ^ 

{v\ (v,x)eE} 

For the COSINE method, the hub score is defined as 
the cosine similarity of the neighbor sets of u and v (consid- 
ered as 0-1 vectors). Then, the authority score, similar to 
HITS, is defined by: 

Ojx ^ ^ hy 

{v\ {v,x)eE} 

We performed 10 iterations for each method to calculate 
the hub and authority scores. After generating the lists of 
predicted links, we calculated how many of the new friend- 
ships that were made by each user between the two dates 
were captured by the top 100 or top 1000 predictions. Fi- 
nally, we averaged these numbers over the 100 selected users. 
The results are presented in table [T] 

Notice that we do not expect these numbers to be large, 
because they are the number of friendships out of the pre- 
diction/recommendation list that the user made without be- 
ing exposed to the recommendations at all. Also, in our 
experiments, each user had only 10-30 new friends, which 
is an upper bound on these numbers. This number would 
presumably be very different (i.e., much larger) if the user 



first received the recommendations and then decided which 
friendships to make. Nonetheless, the relative values of these 
numbers for different algorithms are good indicators of the 
predictive ability of those algorithms, specially when the 
differences are as pronounced as in our experiments. As 
we see from table 1, the systems based on random walks 
(i.e., Personalized PageRank and SALSA) perform the best: 
they significantly outperform HITS, and they also do bet- 
ter than the cosine similarity based link prediction system. 
These results are in accordance with the previous literature 
indicating the effectiveness of random walk based methods 
for the link prediction problem [31]. Moreover, it should 
be mentioned that there is also axiomatic support for this 
outcome [2]|6]. 

B. PROOF OF THEOREM 1 

It is already proved in (3j that — 7r„. So, we only 

need to prove the sharp concentration result. First, assume 
R — 1. Fix an arbitrary node v. Define Xu to be e times 
the number of visits to v in the walk stored at u, Yu to be 
the length of this walk, Wu = and = E[Xu\- Then, 
XuS are independent, 7f„ = (hence tti, = ^^p^), 

< X^ < Wu, and = 1. Then, it is easy to see that: 

Thus: 

Pr[S„>(l + 5).„]<-J^ 

gtn (l + (5)7rTj — gt 71.(1 + 5)771; 

„^„(l_B[e*"']) 

= ^ e 

gtri(l + (5)7r^; — 

where W = eY is a random variable with Y having geo- 
metric distribution with parameter e, and (5 is a constant 
depending on S (and e), found by an optimization over t. 

Therefore, we see that if tt^ = f2(lnn/n) (i.e., if 7r„ is 
slightly larger than the average PageRank value 1/n), then 
we already get a sharp concentration with R = 1 (the anal- 
ysis for Pr[nv < (1 — S)ttv] is similar, and hence we omit it 
here) . 

Now, assume we have R walk segments stored at each 
node, where we do not necessarily have R — 1. Then, similar 
to the above tail analysis, we get: 

Pr[^v >{1 + <5)7r„] < e-"-^""'' 

Therefore, choosing R — we get exponentially de- 

caying tails. Notice that this means even for average values 
of TTi, (i.e., for ttv = 0(l/n)), we have sharp concentration 
with R as small as O(lnn). 

This finishes the proof of the theorem. 

C. PROOF OF LEMMA 7 

If Xs^„ is the number of times that we visit v when we take 
a random walk starting at the stationary distribution, then 
by coupling our walk with this stationary walk at the first 
reset time ta and all the steps afterwards, we can see that 



Xs.v — Xs,v — Xtg^v — Xtg,v Since Xt^,v — Ylv -^ts,v — 
ts, we get: 

V V 

= Y \ElXts,v] - E[Xt^,,]\ < 2E[U] = ^ 
which proves the lemma. 



